Purpose & objective of this analysis

The purpose of this project is to uncover and document valued actionable insights which are contained within the available source data for the benefit of the target audience.

The target audience includes coaches, personal trainers, athletes, ahletics governing body officials, interested members of the public, sports enthusiasts, sports statisticians and data scientists.

The objective is to explore Victorian interclub athletic competition results data for the complete 2017-18 season and identify:

  • Natural groupings & patterns within the data
  • Basic descriptive statistics across the entire dataset
    • How many competitions per athlete
    • How many events per athlete
    • Min, max, averages and quartiles for each event
    • Geography statistics: By whole of Victorian state and by competition zone/region
  • Further opportunities for athletics data analysis

Important note: This analysis is an independent analysis conducted by Bree McLennan, using publically available data from the Athletics Victoria Website. This analysis does not represent the opinions of Athletics Victoria.

Key questions from the target audience to guide this analysis

  1. What are the participation rates at interclub competitions?
    • Can we break this down by zones and clubs?
    • How many athletes participated in all rounds of competition?
    • How many athletes competed at “away” venues?
  2. How many opportunities are there for each event type?
    • Can we see stats by event grouping?
  3. How many incomplete events or invalid event attempts occurred?
  4. Is there any pattern to performances as the season progresses?
  5. How many venues are involved throughout the season?
    • What’s the “windiest” venue?
  6. Can we see how points are distributed for performances?
    • What other alternatives to point scoring are there?

Context specific process flow for this analysis

  1. Define the parameters: Purpose, objective, and rough timelines
  2. Obtain target audience input
  3. Obtain source data Athletics Victoria “AV Shield 2017-18”
  4. Conduct a risk assessment on source data with respect to purpose & objective
    • Discarding any data which is not relevant to the analysis guiding questions
  5. Technical setup to commence analysis:
    • Github repository
    • R Project file
      • Load data >> Prepare data >> Merge reference data >> Transform data >> Analyse data
  6. Explore data & key guiding questions to discover answers
  7. Peer review & publish analysis and findings
  8. Obtain audience feedback, review and apply updates where appropriate
    • Opportunities to subsequent analysis

Analysis data considerations

The data for interclub rounds 1 to 12 is contained in individual csv files, by round, for each participating Victorian region.

General description of the source data:

  • Round 7 is excluded because it was cancelled due to extreme weather.
  • There are 77 individual csv files for season 2017-18
  • The CSV file contents can be described as: performance results for each athlete by event completed, for a round of competition for a specific region. Season 2017-18.
  • There are 21 variables in the source data, ranging from athlete registration ID, event specification, performance result, age group, club, venue, wind reading and completion status

Technical approach to creating the analysis data:

  • Append all CSVs together to create one source dataset
  • Re-name and format variables for data type consistency
  • Binarise variables where appropriate
  • Create hierarchial groupings for event types, veues, and age groups
  • Triage missing data (careful application of subject matter expertise)
    • Particularly with AWD classification performance adjustments, venue names, event specifications and event status (DNQ, INV)
  • Merging on reference data by created keys:
    • Club details (shortname, full name, zone)
    • Venue details (geographical location, track type)
    • Performance adjustment (AWD & masters age group athletes)
  • Calculate athlete finishing order per event and point scoring methods

Sample of the created analysis dataset

# Randomly sample 6 rows from the analysis dataset
head(wrk.03DataTrans_03[sample(nrow(wrk.03DataTrans_03))])

Exploring the key questions

  1. What are the participation rates at interclub competitions?
    • Can we break this down by zones and clubs?
    • How many athletes participated in all rounds of competition?
    • How many athletes competed at “away” venues?
# Participation rates
wrk.03DataTrans_Q1A <- wrk.03DataTrans_03 %>%
  filter(KEYRegistrationNumber %ni% c("0")) %>% #remove teams
  group_by(ORDCompetitionRound) %>%
  summarise(NUMAthletes_RV = n_distinct(KEYRegistrationNumber),
            NUMTotalEventParticipation = n())
wrk.03DataTrans_Q1A
plot(wrk.03DataTrans_Q1A)

#table(wrk.03DataTrans_03$ORDCompetitionRound, n_distinct(wrk.03DataTrans_03$CATAthleteRegisteredClub))

# Athletes competing in all rounds of competition
wrk.03DataTrans_Q1B <- wrk.03DataTrans_03 %>%
  filter(KEYRegistrationNumber %ni% c("0")) %>% #remove teams
  group_by(KEYRegistrationNumber) %>%
  summarise(NUMAthletesRounds = n_distinct(ORDCompetitionRound)) %>%
  filter(NUMAthletesRounds >= 11) %>%
  summarise(NUMTotalAthletesAllRounds = n(),
            NUMRounds = max(NUMAthletesRounds))
            
wrk.03DataTrans_Q1B
# How many athletes competed at away venues?

wrk.03DataTrans_Q1C <- wrk.03DataTrans_03 %>%
  filter(KEYRegistrationNumber %ni% c("0")) %>% #remove teams
  group_by(KEYRegistrationNumber) %>%
  summarise(NUMAthletesAway = sum(as.numeric(BINAthleteCompeteAwayVenue))) %>%
  filter(NUMAthletesAway >1)
           
wrk.03DataTrans_Q1C